Deep Learning - The Mathematics Behind Neural Network - Part 2

Note: This blog is part of a learn-along series, so there may be updates and changes as we progress.

In the previous blog, we covered the foundational concepts of neural networks. In this post, we learn the mathematics behind a basic neural network structure as illustrated below:

Introduction to Neural Network Structure

  • Input Nodes: n1n_1 and n2n_2
  • Hidden Nodes: n3n_3 to n8n_8
  • Output Node: n9n_9

Each node is associated with a bias, denoted as bib_i, and each synapse (connection between nodes) has a weight, denoted as wiw_i. The initial input values are x1x_1 and x2x_2, and y^\hat{y} represents the target output value.

Initializing Inputs, Weights, and Biases

First, let's give the inputs, weights and biases an initial value to better visualize what is happening.

 x1=0.23,x2=0.55x_1 = 0.23, x_2 = 0.55  w1=0.1,w2=0.2,w3=0.3,w4=0.4,w5=0.5,w6=0.6,w7=0.7,w8=0.8,w9=0.9w_1 = 0.1, w_2 = 0.2, w_3 = 0.3, w_4 = 0.4, w_5 = 0.5, w_6 = 0.6, w_7 = 0.7, w_8 = 0.8, w_9 = 0.9  w10=0.1,w11=0.2,w12=0.3,w13=0.4,w14=0.5,w15=0.6,w16=0.7,w17=0.8,w18=0.9w_{10} = 0.1, w_{11} = 0.2, w_{12} = 0.3, w_{13} = 0.4, w_{14} = 0.5, w_{15} = 0.6, w_{16} = 0.7, w_{17} = 0.8, w_{18} = 0.9  b1=0.1,b2=0.4,b3=0.3,b4=0.5,b5=0.5,b6=0.6,b7=0.7b_1 = 0.1, b_2 = -0.4, b_3 = 0.3, b_4 = -0.5, b_5 = 0.5, b_6 = 0.6, b_7 = -0.7

Forward Propagation Through Hidden Layers

Now, let's understand how the input values works through the neural network. Looking at the node n3n_3 which is in the hidden layer, we can see that it receives two inputs through the two synapses and as we discussed in the previous blog, the node usually does two things with the input. It finds its total net input and applies an activation function to the total net input to get the output of node n3n_3. Following some common practices we will use ReLu which is an activation function for the node(s) of the hidden layers and Sigmoid for the node(s) of the output layer.

netn3=w1x1+w4x2+b1net_{n_3} = w_1 \cdot x_1 + w_4 \cdot x_2 + b_1

netn3=0.10.23+0.40.55+0.1=0.343net_{n_3} = 0.1 \cdot 0.23 + 0.4 \cdot 0.55 + 0.1 = 0.343

outn3=max(0,netn3)out_{n_3} = max(0, net_{n_3})

outn3=max(0,0.343)=0.343out_{n_3} = max(0, 0.343) = 0.343

Here is the output for the rest of the nodes in the hidden layer 1:

outn4=max(0,0.079)=0out_{n_4} = max(0, -0.079) = 0

outn5=max(0,0.699)=0.699out_{n_5} = max(0, 0.699) = 0.699

Now, the output of the nodes in the hidden layer 1 becomes the input of the nodes in the hidden layer 2 as shown in the diagram below.

And after repeating the same steps to find the output of the nodes we get the following outputs.

outn6=max(0,0.0197)=0.0197out_{n_6} = max(0, 0.0197) = 0.0197

outn7=max(0,1.1239)=1.1239out_{n_7} = max(0, 1.1239) = 1.1239

outn8=max(0,1.3281)=1.3281out_{n_8} = max(0, 1.3281) = 1.3281

Computing Output Layer Activation

As mentioned above, we will be using the Sigmoid function as the activation function for the nodes in the output layer, which in this case is only one node.

netn9=w16outn6+w17outn7+w18outn8+b7net_{n_9} = w_{16} \cdot out_{n_6} + w_{17} \cdot out_{n_7} + w_{18} \cdot out_{n_8} + b_7

netn9=1.4082net_{n_9} = 1.4082

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

y^=outn9=σ(netn9)=11+enetn9\hat{y} = out_{n_9} = \sigma(net_{n_9}) = \frac{1}{1 + e^{-net_{n_9}}}

y^=σ(1.4082)=11+e1.4082=0.8035\hat{y} = \sigma(1.4082) = \frac{1}{1 + e^{-1.4082}} = 0.8035

Calculating Total Error

Our next step is to calculate the total error of the neural network. This can be done using a variety of methods but we will be making the use of the Squared Error with a multiplier of 12\frac{1}{2} so that the derivative we will be doing later on will be much cleaner.

E(y,y^)=12(yy^)2E(y, \hat{y}) = \frac{1}{2}\sum (y - \hat{y})^2

y^\hat{y} represents the ideal output and yy represents the actual output. And let's assume that y=0.01y = 0.01 to continue with our explanation.

E(y,y^)=12(yy^)2E(y, \hat{y}) = \frac{1}{2} (y - \hat{y})^2

E(0.01,0.8035)=12(0.010.8035)2=0.315E(0.01, 0.8035) = \frac{1}{2} (0.01 - 0.8035)^2 = 0.315

If we had more than one output, we would have to calculate the error for each output and sum them to get the total error. But since we have only one output we can say that Etotal=0.315E_{total} = 0.315

Backpropagation and Weight Updates

Next, we have to do the backwards pass also know as backpropagation which updates the weights and biases. This is done to bring the ideal output closer to the actual output, which also reduces the total error in the process. Let's first try to update the weight w16w_{16}. Before we update it, we must know how much a change in the weight w16w_{16} affects the total error EtotalE_{total}.

Etotalw16\frac{\partial E_{\text{total}}}{\partial w_{16}}

If we apply the chain rule to Etotalw16\frac{\partial E_{\text{total}}}{\partial w_{16}} we get:

Etotalw16=Etotaloutn9  outn9netn9  netn9w16\frac{\partial E_{total}}{\partial w_{16}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}}  \cdot  \frac{\partial out_{n_9}}{\partial net_{n_9}}  \cdot  \frac{\partial net_{n_9}}{\partial w_{16}}

Etotal=12(yy^)2=12(youtn9)2E_{total} = \frac{1}{2}\sum (y - \hat{y})^2 = \frac{1}{2} (y - out_{n_9})^2

Etotaloutn9=outn9y=0.80350.01=0.7935\frac{\partial E_{\text{total}}}{\partial out_{n_9}} = out_{n_9} - y = 0.8035 - 0.01 = 0.7935

outn9=11+enetn9out_{n_9} = \frac{1}{1 + e^{-net_{n_9}}}

outn9netn9=outn9(1outn9)=0.8035  (10.8035)=0.1579\frac{\partial out_{n_9}}{\partial net_{n_9}} = out_{n_9}(1 - out_{n_9}) = 0.8035  \cdot  (1 - 0.8035) = 0.1579

netn9=w16  outn6+w17  outn7+w18  outn8+b7net_{n_9} = w_{16}  \cdot  out_{n_6} + w_{17}  \cdot  out_{n_7} + w_{18}  \cdot  out_{n_8} + b_7

netn9w16=outn6=0.0197\frac{\partial net_{n_9}}{\partial w_{16}} = out_{n_6} = 0.0197

Combining these we get:

Etotalw16=(outn9y)  outn9  (1outn9)  outn6\frac{\partial E_{\text{total}}}{\partial w_{16}} = (out_{n_9} - y)  \cdot  out_{n_9}  \cdot  (1 - out_{n_9})  \cdot  out_{n_6}

Etotalw16=(0.7935)  (0.1579)  0.0197=0.0025\frac{\partial E_{\text{total}}}{\partial w_{16}} = (0.7935)  \cdot  (0.1579)  \cdot  0.0197 = 0.0025

Now, we can update the weight w16w_{16} using the learning rate η\eta:

w16new=w16oldη  Etotalw16w_{16}^{\text{new}} = w_{16}^{\text{old}} - \eta  \cdot  \frac{\partial E_{\text{total}}}{\partial w_{16}}

Assuming a learning rate η=0.01\eta = 0.01:

w16new=0.70.01  0.0025=0.699975w_{16}^{\text{new}} = 0.7 - 0.01  \cdot  0.0025 = 0.699975

We can repeat this process for w17w_{17} and w18w_{18}:

w17new=0.80.010.0850=0.798592w_{17}^{\text{new}} = 0.8 - 0.01 \cdot 0.0850 = 0.798592

w18new=0.90.010.1003=0.898336w_{18}^{\text{new}} = 0.9 - 0.01 \cdot 0.1003 = 0.898336

Similarly, we update the biases. Let's start with b7b_7:

Etotalb7=Etotaloutn9  outn9netn9  netn9b7\frac{\partial E_{total}}{\partial b_{7}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}}  \cdot  \frac{\partial out_{n_9}}{\partial net_{n_9}}  \cdot  \frac{\partial net_{n_9}}{\partial b_{7}}

netn9=w16  outn6+w17  outn7+w18  outn8+b7net_{n_9} = w_{16}  \cdot  out_{n_6} + w_{17}  \cdot  out_{n_7} + w_{18}  \cdot  out_{n_8} + b_7  

netn9b7=1\frac{\partial net_{n_9}}{\partial b_7} = 1

Etotalb7=(0.7935)(0.1579)1=0.1253\frac{\partial E_{\text{total}}}{\partial b_7} = (0.7935) \cdot (0.1579) \cdot 1 = 0.1253

b7new=b7oldηEtotalb7b_7^{\text{new}} = b_7^{\text{old}} - \eta \cdot \frac{\partial E_{\text{total}}}{\partial b_7}

b7new=0.70.010.1253=0.701253b_7^{\text{new}} = -0.7 - 0.01 \cdot 0.1253 = -0.701253

Iterative Training and Error Reduction

To update the weights and biases in the hidden layers, we need to propagate the error backward from the output layer to the hidden layers. We'll start by calculating the partial derivatives for the weights of the synapses going in the second hidden layer, and then move to the weights of the synapses going in the first hidden layer.

Etotalw7=Etotaloutn9  outn9netn9  netn9outn6  outn6netn6  netn6w7\frac{\partial E_{total}}{\partial w_{7}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}}  \cdot  \frac{\partial out_{n_9}}{\partial net_{n_9}}  \cdot  \frac{\partial net_{n_9}}{\partial out_{n_6}}  \cdot  \frac{\partial out_{n_6}}{\partial net_{n_6}}  \cdot  \frac{\partial net_{n_6}}{\partial w_7}

Etotaloutn9=0.7935\frac{\partial E_{\text{total}}}{\partial out_{n_9}} = 0.7935

outn9netn9=0.1579\frac{\partial out_{n_9}}{\partial net_{n_9}} = 0.1579

netn9outn6=w16=0.7\frac{\partial net_{n_9}}{\partial out_{n_6}} = w_{16} = 0.7

outn6netn6={1if netn6>00otherwise\frac{\partial out_{n_6}}{\partial net_{n_6}} = \begin{cases} 1 & \text{if } net_{n_6} > 0 \\ 0 & \text{otherwise} \end{cases}
outn6netn6=1 (since netn6=0.0197>0)\frac{\partial out_{n_6}}{\partial net_{n_6}} = 1 \text{ (since } net_{n_6} = 0.0197 > 0)

netn6w7=outn3=0.343\frac{\partial net_{n_6}}{\partial w_{7}} = out_{n_3} = 0.343

Etotalw7=0.7935  0.1579  0.7  1  0.343=0.03008\frac{\partial E_{total}}{\partial w_{7}} = 0.7935  \cdot  0.1579  \cdot  0.7  \cdot  1  \cdot  0.343 = 0.03008

Now, update the weight w10w_{10} using the learning rate η=0.01\eta = 0.01:

w7new=w7oldηEtotalw7w_{7}^{\text{new}} = w_{7}^{\text{old}} - \eta \cdot \frac{\partial E_{\text{total}}}{\partial w_{7}}

w7new=0.70.01  0.03008=0.699w_{7}^{\text{new}} = 0.7 - 0.01  \cdot  0.03008 = 0.699

Similarly, for w1w_1 we apply the chain rule:

Etotalw1=Etotaloutn9outn9netn9netn9outn6outn6netn6netn6outn3outn3netn3netn3w1\frac{\partial E_{\text{total}}}{\partial w_{1}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}} \cdot \frac{\partial out_{n_9}}{\partial net_{n_9}} \cdot \frac{\partial net_{n_9}}{\partial out_{n_6}} \cdot \frac{\partial out_{n_6}}{\partial net_{n_6}} \cdot \frac{\partial net_{n_6}}{\partial out_{n_3}} \cdot \frac{\partial out_{n_3}}{\partial net_{n_3}} \cdot \frac{\partial net_{n_3}}{\partial w_{1}}

By iterating this process (training), the total error decreases, and the neural network improves its task performance.

In this post, we've covered the mathematics behind a basic neural network, focusing on how the inputs, weights, and biases interact to produce the final output. We've walked through the process of forward propagation, calculating the output of each node, and applied the backpropagation algorithm to update the weights and biases, reducing the total error.

Until next time, signing off.